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Bayesian  Scoring  Procedures  for  Computerized  Adaptive 
Tests,"  by  D.  R.  Divgi,  August  1987 

1.  Enclosure  (1)  is  forwarded  as  a  matter  of  possible  interest. 

2.  A  computerized  adaptive  testing  (CAT)  version  of  the  Armed  Services 
Vocational  Aptitude  Battery  (ASVAB)  is  being  developed  for  joint-service 
use  by  the  Navy  Personnel  Research  and  Development  Center  (NPRDC) . 

There  are  different  ways  of  computing  an  examinee's  CAT  score.  This 
Research  Memorandum  compares  three  Bayesian  scoring  procedures  - 
posterior  mean,  posterior  mode,  and  Owen's  approximation  -  in  terms  of 
their  reliabilities  and  of  their  sensitivity  to  changes  in  item 
parameters  from  paper-pencil  to  CAT  administration. 
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ABSTRACT 


The  computerized  adaptive  version  of  the  Armed 
Services  Vocational  Aptitude  Battery  will  use  a 
Bayesian  procedure  for  computing  test  scores.  Proper¬ 
ties  of  three  common  Bayesian  procedures  are  examined 
in  this  research  memorandum.  The  results  show  that  the 
procedures  are  almost  equally  reliable  and  that  reliability 
drops  if  item  parameters  change  from  paper-pencil  to 
computerized  administration. 


EXECUTIVE  SUMMARY 


INTRODUCTION 

The  Department  of  Defense  is  developing  a  computerized  adaptive  testing  (CAT) 
version  of  the  Armed  Services  Vocational  Aptitude  Battery  (ASVAB).  In  CAT,  each 
examinee  is  characterized  by  a  value  of  ability,  0;  each  item  is  characterized  by  three 
parameters  which  represent  discriminating  power,  difficulty,  and  the  effect  of  guessing. 
An  experimental  version  of  CAT-ASVAB  has  been  developed  and  was  administered  to 
recruits  from  all  services  in  a  study  of  CAT  validity. 

The  prior  distribution  and  an  examinee’s  item  responses  together  provide  the 
posterior  distribution  of  that  individual’s  ability.  Different  scoring  procedures  (called 
“estimators”  in  statistics)  can  be  used  for  calculating  an  estimate  of  the  examinee’s 
ability.  The  purposes  of  this  research  memorandum  are  to  distinguish  between  theoretical 
and  practical  criteria  for  choosing  among  estimators  and  to  evaluate  the  psychometric 
properties  of  three  procedures. 


THEORETICAL  vs.  PRACTICAL  CRITERIA 

Some  researchers  have  recommended,  on  theoretical  grounds,  that  the  mean  of  the 
posterior  distribution  be  used  as  the  ability  estimate.  Their  criterion  for  evaluating  an 
estimator  is  its  mean  squared  error  (MSE) — that  is,  the  average  of  the  squared  difference 
between  the  true  0  and  its  estimate.  In  practice,  the  MSE  criterion  is  irrelevant.  The 
goal  of  the  CAT-ASVAB  and  of  the  paper-pencil  (PP)  ASVAB  is  not  to  estimate  a 
parameter  0  in  a  model  but  to  predict  future  performance.  Therefore,  CAT-ASVAB  will 
be  evaluated  in  the  long  run  on  the  basis  of  its  predictive  validity.  In  the  short  run,  it  will 
be  judged  by  the  reliabilities  of  the  CAT-ASVAB  scores  that  are  used  for  selection  and 
classification.  In  particular,  CAT  subtests  should  be  at  least  as  reliable  as  their  PP 
counterparts.  Reliability  and  validity  of  CAT-ASVAB  may  suffer  if  an  irrelevant 
criterion  is  used  to  select  the  scoring  procedure. 


METHODOLOGY 

Three  Bayesian  estimators  were  evaluated  using  simulations — that  is,  computer 
generation  of  examinees’  abilities  and  item  responses.  One  estimator  was  the  posterior 
mean.  The  second  estimator  was  the  mode  of  the  posterior  distribution,  which  is  frequently 


used  because  it  is  easier  to  compute  than  the  mean.  The  third  estimator  was  Owen’s 
approximation  which,  despite  its  simplicity,  is  known  to  yield  reasonable  estimates. 
MSEs  as  well  as  reliabilities  were  computed  for  all  three  estimators. 

The  simulation  imitated  the  experimental  CAT-ASVAB  as  far  as  possible,  using 
the  same  item  parameters  and  item  selection  algorithm.  The  standard  normal  distribution 
was  used  as  the  prior  distribution.  The  true  distribution  of  ability  was  taken  to  be  normal, 
with  mean  and  variance  equal  to  estimates  based  on  the  recruit  sample.  Each  simulated 
examinee  was  administered  10  items  in  Paragraph  Comprehension  and  15  in  each  of  the 
other  subtests. 

In  the  experimental  CAT-ASVAB  project,  item  parameters  were  estimated  from  a 
PP  administration  of  the  item  pool  and  then  used  in  CAT.  (The  same  procedure  is  being 
followed  in  the  Accelerated  CAT-ASVAB  Project.)  The  implied  assumption  is  that  the 
parameters  are  not  affected  by  the  medium  of  administration.  This  assumption  is  known 
to  be  false.  Its  violation  may  affect  different  estimators  to  different  degrees.  Therefore  a 
second  simulation  was  performed.  The  same  item  parameters  as  in  the  first  simulation 
were  used  for  item  selection  and  to  calculate  all  ability  estimates.  However,  while 
generating  examinees’  responses,  probabilities  of  correct  answers  were  computed  using 
CAT-based  parameter  values  obtained  in  an  earlier  CNA  study. 

RESULTS 

The  posterior  mean,  posterior  mode,  and  Owen’s  approximation  were  found  to  be 
almost  equally  reliable  (see  table  I).  Results  of  the  second  simulation  were  similar  to 
those  of  the  first  in  that  the  three  estimators  were  about  equally  reliable.  Thus,  the 
theoretical  superiority  of  the  posterior  mean  does  not  translate  into  a  higher  reliability 
than  that  of  the  posterior  mode.  Although  Owen’s  estimator  is  equally  reliable,  there  is 
no  justification  for  using  an  approximation  when  an  estimate  based  on  the  correct 
posterior  distribution  can  be  calculated.  Thus,  the  results  support  using  the  posterior 
mode  because  it  is  easier  to  calculate. 

Another  finding  from  the  second  simulation  was  that  changes  in  item  parameters 
from  PP  to  CAT  noticeably  reduced  reliability.  The  decreases  in  reliabilities  are  pre¬ 
sented  in  table  II,  where  the  degree  of  change  in  item  parameters  is  indicated  by  the  mean 
average  absolute  difference  (AAD)  between  the  item  characteristic  curves  in  PP  and  CAT 
administrations.  When  a  new  CAT  is  being  developed,  the  size  of  this  change  is  un¬ 
known  and  hence  simulations  cannot  allow  for  it.  As  a  result,  these  simulations  overes¬ 
timate  CAT  reliability. 


TABLE  I 


RELIABILITIES  OF  SUBTESTS  WHEN  SCORES  ARE  COMPUTED 
USING  POSTERIOR  MEAN,  POSTERIOR  MODE, 

AND  OWEN’S  APPROXIMATION 

Subtest 


Estimator 

GS 

AR 

WK 

PC 

Al 

SI 

MK 

MC 

El 

Mean 

.884 

.899 

.895 

.775 

.887 

.900 

.915 

.841 

.863 

Mode 

.884 

.898 

.894 

.773 

.886 

.899 

.914 

.839 

.862 

Owen 

.883 

.896 

.892 

.777 

.885 

.898 

.911 

.840 

.861 

TABLE  II 

SIZE  OF  CHANGE  IN  ITEM  PARAMETERS  AND  CONSEQUENT  DECREASE 

IN  RELIABILITY 


Subtest 


GS 

AR 

WK 

PC 

Al 

SI 

MK 

MC 

El 

Mean  AAD 

.050 

.047 

.048 

.051 

.061 

.079 

.064 

.089 

.071 

Decrease  in 

.053 

.048 

.047 

.050 

.013 

.035 

.048 

.061 

.049 

reliability 


CONCLUSIONS 

•  Criteria  behind  technical  decisions  should  be  based  on  the  way  CAT- 
ASVAB  will  be  used  and  evaluated,  not  on  abstract  theoretical  principles. 

•  The  mode  of  the  posterior  ability  distribution  is  a  good  scoring  method  for 
CAT-ASVAB. 

•  Because  of  changes  in  item  parameters  from  PP  to  CAT  administration, 
reliabilities  of  CAT-ASVAB  subtests  will  almost  certainly  be  lower  than 
the  values  obtained  in  simulations. 
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INTRODUCTION 


Within  a  few  years  the  Department  of  Defense  may  begin  administering  the 
Armed  Services  Vocational  Aptitude  Battery  (ASVAB)  using  computerized  adaptive 
testing  (CAT).  CAT  is  based  on  item  response  theory  (IRT).  Each  examinee  is  charac¬ 
terized  by  a  value  of  ability  0.  Each  test  item  is  described  by  an  item  response  curve 
which  specifies  how  the  probability  of  correctly  answering  the  item  increases  with  ability. 
The  three-parameter  logistic  model  is  used  in  the  CAT-ASVAB  project.  In  this  model, 
the  probability  of  a  correct  answer  is  given  by 

P(6)  =  c  +  (1  - c) / [1  +  exp(l.7a(b  -  0)}]  , 

where  a,  b,  and  c  are  the  discrimination,  difficulty,  and  guessing  parameters  of  the 
item. 


In  the  experimental  CAT-ASVAB  [1],  which  was  administered  to  recruits  in  all 
services  in  the  CAT  validity  study,  adaptive  testing  begins  with  a  highly  discriminating 
item  of  medium  difficulty,  selected  at  random  from  five  such  items.  The  examinee’s 
answer  is  used  to  estimate  ability,  0.  This  estimate  is  used  to  select  the  next  item  to  be 
administered,  after  which  0  is  reestimated,  and  so  on.  Testing  continues  until  a 
prespecified  number  of  items  has  been  administered. 

A  Bayesian  procedure  is  used  to  update  information  about  0.  One  begins  with  an 
assumed  prior  distribution  of  ability.  After  the  first  item  the  distribution  is  multiplied  by 
the  probability  of  the  examinee’s  response  (P(0)  for  a  correct  answer,  1  -  P(0)  for  a 
wrong  one).  The  product  is  the  posterior  distribution  of  0  (except  for  a  constant  factor 
which  is  of  no  consequence).  The  posterior  distribution  after  the  first  item  is  the  prior 
distribution  for  the  second  item.  When  it  is  multiplied  by  the  probability  of  the  response 
on  the  second  item,  one  obtains  a  new  posterior  distribution  which  yields  the  next  esti¬ 
mate  of  0.  Such  sequential  updating  is  continued  until  a  prespecified  number  of  items 
has  been  administered. 


COMPUTING  EXAMINEE’S  SCORE 

The  exact  posterior  distribution  requires  extensive  calculations.  When  a 
microcomputer  is  used  to  administer  CAT,  these  calculations  may  take  long  enough  for 
the  examinee  to  notice  the  delay  in  administering  the  next  item.  Therefore  Owen’s 
approximation  [2]  was  used  in  the  experimental  CAT-ASVAB  and  will  be  used  in  the 
Accelerated  CAT-ASVAB  Project  [3  (enclosure  3.13,  item  B3)].  Owen’s  procedure 


begins  with  a  normal  prior  distribution.  After  each  item  the  correct  posterior  distribution 
is  replaced  by  a  normal  distribution  with  the  same  mean  and  variance,  which  can  be 
computed  using  relatively  simple  formulas.  The  mean  is  used  as  the  ability  estimate  for 
choosing  the  next  item. 

The  primary  shortcoming  of  Owen’s  estimate  is  that  it  depends  on  the  order  in 
which  items  are  administered  [4;  5  (enclosure  3.3)].  If  two  persons  answer  the  same 
items  the  same  way  but  in  different  orders,  their  Owen  estimates  will  not  be  exactly 
equal.  This  is  not  important  as  long  as  the  estimate  is  used  only  to  select  the  next  item. 
However,  the  final  ability  estimate  following  the  last  item,  after  appropriate  transforma¬ 
tion,  becomes  the  examinee’s  score  on  the  test.  It  should  be  independent  of  the  item 
order,  which  is  the  case  with  estimates  based  on  the  correct  posterior  distribution. 

It  is  highly  improbable,  but  not  impossible,  for  two  persons  to  be  administered  the 
same  items  in  different  orders.  However,  the  very  possibility  is  enough  to  decide  the 
issue.  The  only  justification  for  using  Owen’s  approximation  is  that  it  can  be  computed 
much  faster  than  any  estimate  based  on  the  correct  posterior  distribution.  This  is  impor¬ 
tant  during  item  selection  because  one  must  not  make  the  examinee  wait  too  long  while 
the  interim  ability  estimate  is  being  computed.  However,  once  a  subtest  has  been  com¬ 
pleted,  there  is  no  urgency  about  starting  the  next  one.  There  is  enough  time  to  use  the 
correct  posterior  distribution.  Thus,  there  is  no  argument  in  favor  of  Owen’s  approxima¬ 
tion  as  the  final  ability  estimate.  Hence  its  dependence  on  item  order,  although  trivial  in 
its  impact  on  examinees,  suffices  to  rule  it  out  for  the  final  estimate. 

The  two  popular  Bayesian  estimators  are  the  mean  and  the  mode  of  the  posterior 
distribution  of  ability.  However,  they  cannot  be  reported  to  test  users.  The  ASVAB  has 
its  own  score  scale  based  on  Form  8a,  and  it  must  be  used  to  report  CAT-ASVAB  scores. 
Therefore,  each  CAT-ASVAB  subtest  score  will  be  equated  to  an  8a  score.  As  the  first 
step  in  this  equating,  the  ability  estimate  will  be  converted  into  the  expected  number  right 
score  on  Form  8a  [3  (enclosure  3.13,  item  E2.1)].  The  objective  of  this  paper  is  to 
compare  the  mode  and  the  mean  as  estimators,  in  both  9  and  number-correct  metrics. 
Owen’s  estimate,  in  the  9  metric  only,  is  included  because  it  was  used  as  the  final  score 
in  the  experimental  CAT-ASVAB  system  [l  (p.  5)]. 

Before  one  can  evaluate  and  compare  estimators  (i.e.,  procedures  for  scoring  the 
test),  one  must  choose  a  criterion.  A  criterion  based  on  statistical  decision  theory  differs 
from  one  based  on  practical  and  psychometric  considerations.  The  distinction  is  ex¬ 
plained  in  detail  below,  because  if  the  choice  of  a  scoring  procedure  is  based  on  an 
irrelevant  criterion,  the  usefulness  of  CAT-ASVAB  may  suffer. 
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THEORETICAL  vs.  PRACTICAL  CRITERIA 


Let  9  be  an  estimator  of  9.  £(91  9)  represents  its  expected  (i.e.,  mean)  value  in  a 
subpopulation  of  examinees,  all  of  whom  have  ability  9  .  In  general,  this  mean  does  not 
equal  9.  The  difference  is  bias  B,  which  depends  on  9.  Thus,  one  can  write 

8  =  0  +  5  +  e  ,  (1) 

where  e  is  random  error.  Mean  e  is  zero  at  each  value  of  8  but  its  variance  and  the 
shape  of  its  distribution  may  depend  on  9.  The  error  of  estimation  is  ( B  +  e)  and  the 
mean  squared  error  over  the  entire  population  of  examinees  is 

MSE(Q)  =  E(B2 )  +  Var(e)  . 

Bock  and  Mislevy  [6]  and  Sympson  [7]  have  argued  that  the  mean  of  the  posterior 
distribution  should  be  used  for  estimating  ability  in  CAT,  because  it  is  the  estimator  with 
the  smallest  MSE.  The  argument  is  invalid  for  two  reasons.  First,  it  is  based  on  three 
assumptions:  (1)  the  three-parameter  model  is  correct;  (2)  the  item  parameters  are  known 
exactly;  and  (3)  the  prior  distribution  equals  the  true  distribution  of  ability  in  the  popula¬ 
tion.  In  practice,  all  three  assumptions  are  false  to  some  extent. 

The  second  reason,  which  is  more  important  than  the  first  one,  is  that  the  MSE 
criterion  is  irrelevant  to  CAT-ASVAB.  The  goal  of  CAT-ASVAB  is  to  predict  future 
performance,  not  to  estimate  a  parameter  in  a  theoretical  model.  Therefore  CAT-ASVAB 
will  be  evaluated  in  the  long  run  on  the  basis  of  its  predictive  validity.  In  the  short  run  it 
will  be  judged  by  the  reliabilities  of  its  scores.  Hence,  in  this  study,  comparisons  of 
estimators  are  based  on  concepts  of  classical  test  theory. 

In  classical  test  theory,  the  score  X  on  a  test  is  an  estimate  of  the  examinee’s  true 
score  T  which,  by  definition,  is  the  mean  E{X  ( 9)  one  would  obtain  if  one  could  test  the 
examinee  repeatedly.  (Equivalently,  it  is  the  mean  over  the  subpopulation  of  examinees 
with  the  same  9.)  Therefore  the  examinee’s  true  score  depends  on  the  procedure  used  to 
score  the  test.  The  difference  between  X  and  T  is  the  error  of  measurement: 

X  =  T  +  e  (2) 


with  E(e\T)  =  0.  The  reliability  of  X  is 


R  =  Var  (D/[Var  (T)  +  Var  (*)]  =  1  -  Var  (<?)/[ Var  (T)  +  Var  (e)] 


When  the  score  X  is  an  estimate  0  of  0, 

T  =  Q  +  B  . 

The  random  error  e  in  equation  1  is  the  same  as  the  error  of  measurement  e  in 
equation  2. 

Thus,  the  role  of  bias  in  evaluation  of  a  scoring  procedure  depends  on  what  is 
being  estimated.  If  X  is  considered  an  estimate  of  the  model  parameter  0,  B  is  a  pan  of 
estimation  error  and  minimum  MSE  is  a  legitimate  criterion  for  choosing  among  es¬ 
timators  (as  in  [6,  7]).  If  X  is  considered  an  estimate  of  T,B  is  a  pan  of  the  examinee’s 
true  score  and  the  criterion  is  maximum  reliability.  The  latter,  not  the  former,  is  the  role 
of  the  test  score  in  mental  measurement. 

The  two  criteria,  MSE  and  reliability,  may  yield  different  conclusions.  This  can 
be  seen  with  a  trivial  example.  Suppose  0  is  the  posterior  mean.  Then  10  times  0  has  a 
much  larger  MSE.  However,  since  Var(T)  and  Var(e)  both  are  multiplied  by  100,  the 
reliability  of  10  0  is  the  same  that  of  0.  Except  in  the  case  of  very  simple  models,  reliability 
of  an  estimator  cannot  be  calculated  theoretically.  Simulated  or  real  data  are  needed. 

SIMULATION 

The  simulation  attempted  to  imitate  the  experimental  CAT-ASVAB  as  far  as 
possible.  Item  parameter  estimates  for  the  experimental  CAT-ASVAB  item  pool  were 
used  as  the  true  item  parameters.  The  information  table  contained  37  equally  spaced 
ability  values  from  -2.25  to  2.25.  The  “54321  strategy”  was  used  to  randomize  the 
choice  of  the  first  four  items  in  each  subtest  [1  (p.  A12  and  Supplement,  p.  91)]. 

For  each  subtest,  2,000  abilities  were  sampled  from  a  normal  distribution;  mean 
and  standard  deviation  of  the  normal  were  the  estimates  obtained  for  the  recruit  sample 
that  took  the  experimental  CAT-ASVAB  [8].  However,  in  keeping  with  the  experimental 
system,  the  standard  normal  distribution  (abbreviated  as  /V(0,1))  was  used  as  the  prior 
distribution  in  all  calculations. 

The  adaptive  subtests  in  CAT-ASVAB  are  General  Science  (GS),  Arithmetic 
Reasoning  (AR),  Word  Knowledge  (WK),  Paragraph  Comprehension  (PC),  Auto  Infor¬ 
mation  (AI),  Shop  Information  (SI),  Mathematics  Knowledge  (MK),  Mechanical  Com¬ 
prehension  (MC),  and  Electronics  Information  (El).  Each  examinee  was  administered 
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10  items  in  PC  and  15  items  in  all  the  other  subtests.  Posterior  mean  and  mode  were 
calculated  at  the  end  of  each  subtest. 

As  mentioned  earlier,  CAT-ASVAB  ability  estimates  will  be  converted  into 
expected  number-correct  scores  on  ASVAB  Form  8a.  As  this  transformation  was  not 
available  for  all  subtests,  it  was  imitated  as  follows.  For  each  subtest,  parameters  of  all 
items  in  the  pool  were  averaged  to  obtain  mean  values  a,  b,  and  c .  These  were  used  to 
transform  the  posterior  mean  to  the  percent-correct  metric: 

/’(mean)  =  100c  +  100(1  -  c)/[l  +exp{1.7a  (b  -  mean)}]  . 

P(mode)  was  calculated  similarly.  Two  other  scores  were  obtained  by  transforming  the 
posterior  distribution  to  the  percent-correct  metric  First,  and  then  computing  its  mean  and 
mode.  These  will  be  denoted  by  mean(P)  and  mode(P).  The  parameter  ( not  the  true 
score)  being  estimated  by  all  four  of  these  scores  is  P( 9),  the  percent-correct  transform 
of  0. 

Mean  squared  error  and  correlation  with  the  corresponding  parameter  were 
computed  for  each  score.  For  each  estimator,  true  score  as  a  function  of  the  relevant 
parameter  (9  or  P( 9))  was  estimated  by  cubic  regression.  Reliability  was  estimated  as 
the  multiple  R-square  of  this  fit  and  then  used  to  calculate  variances  of  true  scores  and 
measurement  errors. 


RESULTS 

Table  1  presents  results  for  estimators  in  the  0  metric.  Posterior  mean  does  not 
have  smaller  MSE  than  the  mode.  This  happens  because  the  N( 0,1)  prior  distribution 
differs  from  the  marginal  distribution,  that  is,  the  distribution  of  ability  in  the  population 
Thus,  the  theoretical  superiority  of  the  posterior  mean  applies  only  when  the  test  is 
administered  to  one  specific  population. 

When  the  classical  criterion  of  reliability  is  used,  the  mean  and  the  mode  are 
found  to  be  almost  equally  good.  Surprisingly,  in  spite  of  the  drastic  approximations 
involved,  Owen’s  estimate  is  practically  as  reliable  as  those  based  on  the  exact  posterior 
distribution. 

The  squared  correlation  between  9  and  the  estimator  is  often  used  as  the  measure 
of  reliability.  Table  1  shows  that  this  underestimates  reliability  by  a  small  amount.  The 
difference  occurs  because  bias  is  a  nonlinear  function  of  9. 
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To  see  if  the  mean  has  superior  reliability  when  the  prior  distribution  is  correct, 
that  is,  equals  the  marginal  distribution,  the  three  ability  estimates  were  recomputed  for 
each  examinee  using  the  correct  prior  distribution.  The  results  are  shown  in  table  2.  The 
variance  of  measurement  error  falls  when  the  correct  prior  distribution  is  used,  but  so 
does  the  true-score  variance,  with  the  result  that  reliability  increases  only  slightly.  The 
posterior  mean  does  have  smaller  MSE  than  the  mode,  but  its  reliability  is  not  superior  by 
more  than  .002.  Thus,  like  table  1,  table  2  shows  that  the  three  estimators  are  about 
equally  reliable.  Therefore,  to  keep  the  simulations  realistic,  all  further  calculations  use 
the  W(0,1)  prior  distribution  as  in  the  experimental  CAT-ASVAB. 

Table  3  presents  results  for  scores  in  the  percent-correct  metric.  They  show  the 
same  patterns  as  in  table  1.  Mean(P),  which  is  the  posterior  mean  computed  after  trans¬ 
formation  to  the  percent-correct  metric,  is  slightly  more  reliable  than  the  other  three. 
Squared  correlations  with  the  parameter  P( 9)  are  not  presented  because  they  differed 
very  little  from  reliabilities. 

SIMULATION  WITH  MEDIUM-OF-ADMINISTRATION  EFFECT 

The  results  shown  in  tables  1  to  3  are  based  on  simulations  in  which  assumptions 
of  item  response  theory  were  satisfied.  Different  results  may  be  obtained  when  assump¬ 
tions  are  violated.  In  an  operational  CAT  project,  item  parameters  are  estimated  from 
PP  administration  of  the  item  pool  and  then  used  in  CAT.  This  assumes  that  the  param¬ 
eters  are  not  affected  by  the  medium  of  administration.  This  assumption  is  known  to  be 
false.  Using  data  from  the  recruit  sample  to  which  the  experimental  CAT-ASVAB  was 
administered,  Divgi  and  Stoloff  [8j  found  that  observed  P(Q)  differed  substantially  from 
those  calculated  from  the  PP-based  item  parameters.  Therefore,  a  second  simulation  was 
performed  in  which  parameters  were  changed  from  PP  to  CAT  for  those  items  that  had 
been  answered  by  at  least  1,000  recruits  and  hence  had  CAT-based  parameters  estimated 
by  Divgi  [9], 

As  in  operational  CAT,  PP-based  item  parameters  (that  is,  those  used  in  the  first 
simulation)  were  used  for  item  selection  and  to  calculate  all  ability  estimates.  However, 
while  generating  examinees’  responses,  probabilities  of  correct  answers  were  calculated 
using  parameter  values  based  on  CAT  administration  [9], 

Tables  4  and  5  contain  results  in  0  and  percent-correct  metrics,  which  are  similar 
to  those  in  tables  1  to  3  as  far  as  comparisons  among  estimators  are  concerned.  It  is  the 
comparison  between  tables  that  is  interesting.  The  effect  of  the  medium  of  administration 
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TABLE  4 


PROPERTIES  OF  ESTIMATORS  IN  PERCENT-CORRECT  METRIC  WITH  N( 0,1 
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reduces  reliabilities.  Changes  in  item  parameters  increase  error  variance  and,  in  general, 
also  reduce  true-score  variance. 

The  loss  of  reliability  is  presented  in  table  6  for  the  modal  estimate,  which  has 
been  approved  for  use  in  CAT-ASVAB  ([3  (item  6)].  Following  the  CAT-ASVAB  plans, 
the  estimator  in  percent-correct  metric  is  P(mode).  The  first  line  in  table  6  quantifies  the 
size  of  the  medium  effect,  in  terms  of  mean  average  absolute  difference  (AAD)  between 
CAT  and  PP  item  characteristic  curves  [9  (table  2)]. 


TABLE  6 

SIZE  OF  MEDIUM  EFFECT  AND  CONSEQUENT  DECREASES 
IN  RELIABILITY  IN  0  AND  PERCENT-CORRECT  METRICS 

Subtast 
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WK 
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AI 

SI 

MK 

MC 
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Moan  AAD 

.050 

.047 

.048 

.051 

.061 

.079 

.064 

.089 

.071 

0  metnc 

.053 

.048 

.047 

.050 

.013 

.035 

.048 

.061 

.049 

%  metric 

.078 

.048 

.034 

.056 

.023 

.048 

.063 

.046 
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Given  that  the  occupational  subtests  AI,  SI,  MC,  and  El  are  more  sensitive  to  the  5j 

medium  effect  than  the  academic  subtests,  it  is  surprising  that  the  degradation  in 
reliability  is  about  the  same  for  both  types.  A 


r.s 

|\V 

ys 

VO* 


CONCLUSIONS 
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It  is  clear  that  the  smaller  MSE  of  the  posterior  mean  does  not  translate  into 
noticeably  superior  reliability.  This  supports  the  decision  to  use  the  modal  estimate  in 
CAT-ASVAB  [3].  In  fact,  except  for  its  dependence  on  item  order,  even  Owen’s  esti¬ 
mate  would  be  satisfactory.  Results  obtained  recently  by  Sympson  [3]  support  the  use  of 
Owen’s  approximation  for  item  selection. 

The  other  major  conclusion  is  that  CAT  reliability  drops  appreciably  if  the  me¬ 
dium  of  administration  changes  item  parameters  on  a  scale  found  in  the  experimental 
CAT-ASVAB  data  [8,  9J.  Therefore  simulations  without  a  medium  effect  overestimate 
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CAT  reliability.  More  realistic  simulations,  allowing  for  changes  in  item  parameters 
from  PP  to  CAT,  cannot  be  performed  until  enough  CAT  data  are  in  hand  to  permit 
estimation  of  item  parameters.  In  the  meantime,  all  one  can  do  is  refrain  from  making 
strong  claims  about  the  reliability  of  the  CAT  version. 
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